Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate GitLink into the Roslyn build #15592

Merged
merged 1 commit into from
Nov 30, 2016
Merged

Conversation

jaredpar
Copy link
Member

This adds source server support to our binaries by means of GitLink.

This adds source server support to our binaries by means of GitLink.
@jaredpar
Copy link
Member Author

CC @jasonmalinowski, @Pilchie, @dotnet/roslyn-infrastructure

jaredpar added a commit to jaredpar/GitLink that referenced this pull request Nov 29, 2016
Here is the PR where GitLink is being added to our official build process:

dotnet/roslyn#15592
@Pilchie
Copy link
Member

Pilchie commented Nov 29, 2016

Is this all it takes?

@@ -25,6 +25,8 @@

<MSBuild Projects="$(ProjectDir)Roslyn.sln" BuildInParallel="true" Properties="DeployExtension=false" />

<Exec Command="$(NuGetPackageRoot)\GitLink\$(GitLinkVersion)\lib\net45\GitLink.exe $(ProjectDir) -u https://github.com/dotnet/roslyn -f Roslyn.sln -c $(Configuration)" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we still tracking the "make this run for PRs once we fix the perf issues"? I wholeheartedly approve this and don't want to block, but still want to make sure we're tracking the improvement somewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's chat about this in person. The perf issue is no longer relevant with GitLink. But I'm not sure it's going to have the effect we need.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR's are going to be difficult, see GitTools/GitLink#124

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About performance: @AArnott is helping us working on a pdb-only version. This should no longer require the tool to load the solution making it much faster. This will be v3+.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GeertvanHorrik can you clarify why you think PRs are going to be difficult here? Reading that bug it seems to be about finding the target url. This is specified explicitly here hence detection shouldn't happen.

The reason we want this enabled for PRs is crash dump investigation. Our tests are configured such that in certain bad cases it will archive out our EXE / PDB combo for future investigation. Much easier to farm that investigation out to the relevant developer (not necessarily guy who make the PR) if we have source server support in the PDB.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaredpar It depends per source server, but in GitHub it's hard to find the actual raw files on the server (since it's a merge of both original repository + fork). I've been doing some testing, but couldn't find a reliable way to find the actual raw files for a PR.

If you have a reliable way to specify the url (because I don't think the specified url in this script will point to the right PR-files), I would love to hear about that.

I do understand why you want it for PR's (although it's a very rare use-case for most people), but it makes total sense.

Copy link
Member Author

@jaredpar jaredpar Dec 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GeertvanHorrik GitLink should know the following two pieces of information:

  • GitHub repo URL
  • Commit SHA

Given that any file should be reachable at the following:

<github repo url>/blob/<git sha>/<relative path to file> 

This logic doesn't need to take into account forks in the PR case. By virtue of creating the PR, the corresponding commit SHA will be available in the original repo (just verified to be sure).

@jaredpar
Copy link
Member Author

@Pilchie yes this is all it takes and yes it seems far too easy. I actually spent 5 minutes getting this all wired up and then a couple hours trying to get an appropriate PDB dumping utility to verify it had actually done the right thing. Was much too easy ...

@jaredpar jaredpar merged commit 1fa93a6 into dotnet:master Nov 30, 2016
@jaredpar jaredpar deleted the fix-link branch November 30, 2016 00:30
@AArnott
Copy link
Contributor

AArnott commented Dec 1, 2016

@jaredpar said:

The perf issue is no longer relevant with GitLink.

Why is that? Oh, I think it's that GitLink doesn't (yet) normalize capitalization of paths per the git repo. So you get faster perf, but at the risk that some source files will fail to be retrieved while debugging. Of course, I'm optimistic we can have both great perf and accurate paths. As @GeertvanHorrik suggested, when we get the "big PR" merged into GitLink, hopefully we can retain the best of everything.

@jaredpar: After the perf problem in PdbGit is solved (and merged into GitLink) will you be interested in the fact that the nuget package reference in project.json is all you need to update your PDBs (you can drop the Exec task invocation in your msbuild script)? Or is there something appealing to your Exec approach that will lead you to retain it? I ask because @GeertvanHorrik and I were having a conversation about it on another PR where we were speculating and I'd love your thoughts on it.

@jaredpar
Copy link
Member Author

jaredpar commented Dec 2, 2016

@AArnott

Oh, I think it's that GitLink doesn't (yet) normalize capitalization of paths per the git repo.

Correct. PdbGit was iterating the index once per file path being normalized. Visiting each item in the index will allocate a fixed number of strings (think it was 4-5). Probably fine for most repos. But Roslyn is a huge repo and many of our PEs have a considerable number of files in them. That all combined together for a lot of GC churn.

Of course, I'm optimistic we can have both great perf and accurate paths.

Completely agree. The fix seems pretty straight forward: just iterate the index once, store it in a map and use that for normalization.

After the perf problem in PdbGit is solved (and merged into GitLink) will you be interested in the fact that the nuget package reference in project.json is all you need to update your PDBs

Why would that be the case though? I don't see a reason why individual developers should be adding source server information to their PDBs. It's only really relevant when I'm building in a non-local enivornment: Jenkins, official builds, etc ... Hence I would at least guard the condition on which we did GitLink on that.

@AArnott
Copy link
Contributor

AArnott commented Dec 2, 2016

Well, if the perf impact was minimal, what if it happened in all Release builds by default, but could be controlled by an msbuild property or environment variable? I'm a fan of "install the package and it just works" so I'm looking for a way to make that work for everybody.

@GeertvanHorrik
Copy link

Not everyone builds packages via a build server / continuous integration. Although I always strongly suggest people to do so, I often see NuGet packages with manual release builds, or worse, debug builds.

Therefore I do agree that integrating this as an msbuild task would work for these people, so I definitely see a use-case here @AArnott . But for the people that have a build server, we recommend running this on the build server (once instead of per-assembly-post-compilation). The advantage of having a single app at the end of a full compilation is that you don't have the cost of starting apps / initialization / etc.

So basically we will have 2 use-cases for the tool:

  1. msbuild task => should only run in release by default, for the people not using build servers, a bit slower since each assembly will execute separately
  2. command line tool => will iterate all pdb files (with ignore patterns), for the people using a build server, will be a bit faster since we can re-use some startup logic (if any)

I would strongly suggest Roslyn uses 2 (which you are already doing right now). But also with GitLink 3+ (which will have the big PR merged), we would still recommend 2.

@jaredpar
Copy link
Member Author

jaredpar commented Dec 2, 2016

One item I'd like to get more information on. Both of you seem to agree that when in Release GitLink should be run. To me that implies you want this to happen even on local developer machines. My question is why?

For a local developer build the relevant SHA isn't guaranteed to be available on Github.com. It could be a local throw away commit, later rebased, scratch work, etc ... Hence adding source server information into the PDB is potentially adding invalid data. It's asking the debugger to go and fetch source data which likely isn't there yet.

Perhaps this just part of the source server work flow that I don't have a good grasp on. But my intuition is that it should only be added in cases where the corresponding SHA is known, or at least very likely, to be present on Github.com. This case is very easy for us to detect and then enable in our build scripts: CI runs and official builds.

@jaredpar
Copy link
Member Author

jaredpar commented Dec 2, 2016

@GeertvanHorrik i think there is one other case for an MSBuild task to consider.

I will give a caveat first though: I can be a bit OCD about reducing chances for errors and this is probably one of those cases 😄

The output folder even in the simplest of projects will contain two copies of the PDB file: the obj and the bin directory. In more complex projects, such as Roslyn, it's possible that the PDB file actually exists more than twice in the bin directory due to dependency copying and deployment issues.

The current GitLink workflow will only augment one of these files: the initial copy to the bin directory. That means there is at least one copy of the PDB file in under the output directory that is "bad".

This opens the door for other tooling errors down the road. Every tool must now know that the version in the bin directory is the one to deploy. Or in more complex builds, it must know which copy in the bin directory is the one to deploy.

Integrating GitLink into an MSBuild task is a potential solution to this problem. If the task ran after build but before copying to the output directory then it would mean all copies of the PDB in the output directory were "good" / contained source server information. No guessing or large scale communication needed. See a PDB, use the PDB

Again, I'm a bit OCD about items like this so adjust the feedback appropriately 😄

@AArnott
Copy link
Contributor

AArnott commented Dec 2, 2016

+1 to @jaredpar's comments. I can't tell you how many times we've had bugs due to modifying only some outputs (whether to sign, or index PDBs, etc.) due to the copying aspect. My personal goal is to make this rewrite so fast that folks won't complain if it runs for every release build.

@GeertvanHorrik said:

The advantage of having a single app at the end of a full compilation is that you don't have the cost of starting apps / initialization / etc.

That's not really an issue as an MSBuild task: it all runs in proc in the MSBuild process, so no matter how many projects are built, the init cost for GitLink remains the same as just one init cost (multipled by the number of msbuild nodes you have running).

@jaredpar said:

But my intuition is that it should only be added in cases where the corresponding SHA is known, or at least very likely, to be present on Github.com.

That's a very interesting idea: GitLink could check during the build whether the commit exists on the remote and only modify the PDB in that case. That automatically will select out most dev builds where they are iterating locally, amending commits, etc. It won't weed out cases where developers do local release builds of publicly available commits. But then, I suspect that's a far smaller % of users. And again, if the build is not significantly slower, I think we've hit the sweet spot between impact and "just working".

As for the risk of modifying a PDB with a commit ID that won't be around for long, I don't see that as any worse than not modifying it at all (so long as perf doesn't suffer). Do you?

@tmat
Copy link
Member

tmat commented Dec 3, 2016

GitLink could check during the build whether the commit exists on the remote

Let's not make local build dependent on connectivity to the Internet.

I don't see the value of GitLink rewriting the PDBs when building locally (vs. on CI/build server).

@GeertvanHorrik
Copy link

Good points about the multiple pdb files in multiple locations. When packaging, I always pick the assemblies from their source directories. For example, in 1 project I have:

  • Catel.Core
  • Catel.MVVM => references Catel.Core

Whenever I pack Catel.Core, I will use the Catel.MVVM libraries, never the referenced versions from Catel.MVVM. But I understand your concerns, I'd like to OCD as well but against what cost? If you are packaging assemblies from directories that they are not written to, aren't you afraid of other side-effects (out of date builds, etc)?

Anyway, back to the issue: this is a major plus for integrating with msbuild. As @AArnott points out, we probably don't have the high cost of instantiating executables on every time. I thought initially that was causing the slow down, but if I understand correctly it's because of the path handling as it was implemented in PdbLink.

I agree we should never do internet checks on builds. We could, in theory, check the local git repository. If a build represents a commit, that one is unique. And whether it's pushed or not, it will never have a different commit ID. Meaning that the pdb will always have the right url, the only "risk" we take is whether the commit has actually been pushed. In that rare case, you will only have the same behavior as we had before: no redirection to the source files. I've been thinking about pointing to the previous (last known) commit on a branch so at least most (unchanged) files are still correct, but this could result in very weird behavior (and this basically could give lots of wrong outdated files). I suggest we don't go here and just accept the 404 returned by the server at this point.

@jaredpar
Copy link
Member Author

jaredpar commented Dec 5, 2016

Let's not make local build dependent on connectivity to the Internet.

I agree. I don't think this is necessary for our scenarios. There are easy ways for us to detect we are running in Jenkins / Official build (already in place and being used). Those are a sufficient guard to knowing the SHA is published.

but if I understand correctly it's because of the path handling as it was implemented in PdbLink.

That was definitely the case for us. GitLink actually ran fast enough that I was concerned it was failing silently. Spent a few hours cracking open the PDBs to make sure source server information was being added 😄

@AArnott
Copy link
Contributor

AArnott commented Dec 5, 2016

Let's not make local build dependent on connectivity to the Internet.

We don't have to. Git locally knows what commits were retrieved from or pushed to remotes. It's a trivial local lookup to know that a commit is local-only or not.

@AArnott
Copy link
Contributor

AArnott commented Dec 12, 2016

I am working on a perf fix for the path normalization. It should be lightning fast compared to the way I had it before.

@AArnott
Copy link
Contributor

AArnott commented Dec 12, 2016

Confirmed: running with path normalization on a pdb in a very large git repo

GitLink (last public release): 1.7 seconds
PdbGit (my GitLink fork): 62 seconds
PdbGit w/ perf fix: 0.67 seconds

So we're looking at a 60% perf improvement from original GitLink!

@AArnott
Copy link
Contributor

AArnott commented Dec 12, 2016

Oh, and to be clear: PdbGit is very close to re-integration with GitLink, so GitLink 3.x should have the perf fix, along with many other improvements.

@AArnott
Copy link
Contributor

AArnott commented Jan 14, 2017

@jaredpar Have you noticed that source server support doesn't actually work on your official Roslyn builds, built with VSTS?

You're probably going to want to either switch to PdbGit (at least temporarily) or update GitLink after it integrates GitTools/GitLink#137.

@GeertvanHorrik
Copy link

Just released 2.4.1 with GitLink#137.

@AArnott
Copy link
Contributor

AArnott commented Jan 15, 2017

Awesome. :)

@Pilchie
Copy link
Member

Pilchie commented Jan 16, 2017

I actually found that it was working when I was looking at very recent Watson crash dumps. It was very nice to not have to go hunting for source for a change.

So happy to finally have this.

@jaredpar
Copy link
Member Author

FYI: #16544

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants